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Continuation of Substance of Interview including description of the general nature of what was agreed to if an 
agreement was reached, or any other comments: Dr. Schneider provided a tutorial on the background of the invention 
and the application of the cumulative ranking algorithm. Dr. Schneider indicated that prior art methods would store the 
m/z values for the first 5 possible combinations of amino acids, but that the values become to voluminous and 
incapable of storage on a computer when one starts looking at the 6th position, or even further out. 
Dr. Schneider stated how the values of the m/z charges observed/measured are not stored, but rather, the values are 
calculated anew for each fragment as it passes into the detection chamber of the mass spec. Dr. Schneider indicated 
that only the mean and standard deviation values are stored in the computer, and that the further one sequences, the 
greater the confidence interval one has as to the sequence of the preceding amino acid residues. However, the 
confidence interval is not necessarily all that great when one has reached the end of a 5-mer or 6-mer as there are no 
additional residues upon which to base m/z values. 

Dr. Schneider indicated that in some instances you have the same, or nearly the same, m/z value for different 
fragments. The aspect of not being able to identify Leu/lle was specifically discussed. Dr. Schneider indicated that 
they adapted a genetic algorithm to the current application. 

Dr. Schneider indicated that a mass spec is limited as to just how far it can see, noting that the mass spec used in the 
examples could only see out 6 amino acids. Dr. Schneider identified 'Top Down Sequencing" as being developed post 
filing, and that employing this post filing technology, one can see out to 50 amino acid residues. 
In response to inquiry by Mr. Sisson as to the ability to accurately sequence any polymer, and citing the 1-4, or 1-6 
glycosidic linkages found in polysaccharides, Dr. Schneider stated that using mass spec, one cannot identify the 
linkage nor identify the saccharide beyond its being a hexose. Dr. Schneider added that post filing, they have gone to 
using PEG (polyethylene glycol) as a standard in MALDI mass spec. 

Dr. Schneider indicated that sequencing is "probabilistic" and that the top ranked sequence may not be the correct 
sequence, directing attention to Example 13 of the specification. The aspect of not being able to identify the correct 
sequence when the target is an unknown was discussed. Dr. Schneider indicated that one could take the ranked 
sequencing values and then go search databases to se if something like it was known. 
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Townsend and Townsend and Crew LLP 

12730 High Bluff Drive Suite 400 

San Diego, CA 92130 

858.350.6100 

kejenki ns@townsend . co m 
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Algorithms for de novo peptide sequencing via tandem 



mass spectrometry 

Bingwen Lu Ting Chen * 

Abstract 

There is growing interest in the qualitative and quantitative analysis of proteins 
on a pro teome- wide scale. Mass spectrometry plays an important role in the high- 
throughput environments of proteomics study. There have been two major developments 
for mass spectrometry technology: (1) instrumental (including physiochemical) devel- 
opment, such as the development of cliromatography teclmology, the development of 
protein ionization technology, and the development of protein-labeling technology, and 
(2) the development of computational algorithms for analysis of mass spectra produced 
by mass spectrometers. In this review, we will give an overview of the development of 
computational algorithms for de novo peptide sequencing using tandem mass spectrom- 
etry. 

* Corresponding author. Address: Molecular and Computational Biology Program, Depart.ment of Biolog- 
ical Sciences, University of Southern California, Los Angeles, CA 90089 USA. Email: tingcJien@hto.usc.edu. 
Tel: 213-740-2415. 
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Keywords: de novo peptide sequencing, mass spectrometry, proteomics, graph the- 
ory, dynamic programming, suhoptimal algorithms 

Teaser: Special focus on de novo peptide sequencing: a useful technique for pro- 
teomics study and peptide drug discovery 

Mass spectrometry and proteomics 

Mass spectrometry is an indispensable tool in the era of modern proteomics research. Mass 
spectrometry has been applied successfully to the study of proteins in the proteome-wide 
manner {e.g., see [1-6]). To take one example, Gavin et aL [3] used tandem affinity purifica- 
tion (TAP) and mass spectrometry to characterize multi-protein complexes in Saccharomyces 
cerevisiae. They processed 1,739 genes and purified 589 protein assemblies. Their analysis 
of these assemblies revealed 232 distinct multi-protein complexes, and they proposed cellular 
roles for 344 proteins, among which 231 had no previous functional annotation. To take a 
second example. Ho et aL [4] employed a method called high-throughput mass spectromet- 
ric protein complex identification (HMS-PCI) to systematically identify protein complexes in 
Saccharomyces cjtrevisiae. Starting with 10% of the predicted proteins, they detected 3,617 
associated proteins covering 25% of the yeast proteome. A third example is the analysis by 
Petricoin et aL [5] of the mass spectrometry-generated proteomic pattern in serum for ovarian 
cancer diagnosis. A training set of mass spectra generated from 50 unaffected women and 50 
ovarian cancer patients were analyzed by an artificial-intelligence algorithm to discover a pro- 
teomic pattern that could completely discriminate cancer samples from normal samples. The 
identified pattern was then used to classify masked serum samples. The authors successfully 
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identified all 50 ovarian cancer samples, while for the 66 cases of non-ovarian cancer 63 were 
classified as non-cancer. They computed the positive predictive value for this validation to be 
94%. This is quite comparable to another ovarian cancer diagnostic measure called CA125, 
which has a positive predictive value of 34%. 

Protein sequencing and identification by tandem mass 
spectrometry 

In a routine protein sequencing and identification by tandem mass spectrometry, the protein 
or proteins are first digested by some enzymes such as trypsin. The resulting peptides are then 
separated by liquid chromatography (LC) and subsequently analyzed by a mass spectrometer. 
The peptides are ionized, and the mawSS-to-charge (m/z) ratios are measured. For tandem mass 
spectrometry, ions within some range of specific m/z ratios are further selected, and a second 
round of fragmentation and m/z measuring is performed. The resulting tandem mass spectra, 
comprising m/z ratios and corresponding intensities, are then used for the identification of the 
original protein or peptide. There are two principal ways of doing this kind of identification. 
One is to correlate the spectra with protein sequences or nucleic acid sequences by database 
searching [7-12]. The other is to derive the peptide sequence directly, without the help of any 
sequence database. The latter method is called de novo peptide sequencing, which we will 
discuss extensively below. 
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The de novo peptide sequencing problem and the scoring 
function 

The de novo peptide sequencing problem, given a tandem mass spectrum, is that of 
seeking a peptide that can best explain the spectnim according to some scoring function. 
This requires us to discuss the meaning of a scoring function. Basically, for a given real 
experimental spectrum, we can find some candidate peptides for this spectrum. We will 
discuss how to find the candidate peptides in later sections. Here, we will assume that the 
candidate peptides are obtained. For each candidate peptide, a hypothetical spectrum can be 
generated in silico^ based on the assumption of some fragmentation patterns and frequencies. 
Different fragmentation patterns will generate different ion types. The usual ion types are 
b-ions and y-ions, as shown in Figure 1. The scoring function then measures the similarity 
of the real spectrum to the hypothetical spectrum of each candidate peptide. The candidate 
peptide associated with the hypothetical spectrum that shows the most similarity with the 
real spectrum is then reported as the best candidate peptide that explains the real spectrum. 
Sometimes the similarity score is associated with a p-value, which gives a probability that the 
score is achieved by random chance. The design of a good scoring function is never an easy 
task and is an active research area in which considerable efforts have been made to find a 
good scoring function [7-14]. 

To illustrate the abstract description of the scoring function above, we will give a brief 
description of one of the scoring methods, a four-step process called the SEQUEST algorithm 
[7]. The algorithm begins (step 1) with tandem mass spectrometry data reduction. In this 
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step, fractional mass-to-charge ratios are rounded to the nearest integers, and then a 10-u win- 
dow around the precursor ion is removed to avoid matching to the unfragmented precursor 
ion. The 200 most abundant ions are selected for scoring purposes. In step 2, the candidate 
peptide sequences are then identified by matching the precursor peptide masses to the masses 
of all possible peptides in a protein database. Peptides with ±3 u or ±1 u are selected as 
candidate peptides. One hypothetical spectrum is then generated for each candidate peptide, 
and these hypothetical spectrums are compared with the real spectrum to produced a prelim- 
inary ranked list of 500 best-fit sequences (step 3). This preliminary ranked list considers the 
number of matching ions within the mass tolerance of ±1 u, the abundances of the matching 
ions, the continuity of an ion series, and the presence of immonium ions for the amino acids 
His, Tyr, Trp, Met, and Phe. Finally, in step 4, these 500 sequencers are then subject to a 
cross-correlation analysis to generate the final ranked list. 



Rationales for de novo sequencing 

Now, let us return to the rationales for doing de novo peptide sequencing. Usually, searching 
against a sequence database is the first choice for peptide identification, because the candidate 
peptides can be found from the database. However, de novo peptide sequencing comei? into 
play in various situations. First, the protein of interest might not be present in the sequence 
database. One case is that the sequence database is incomplete, which is the situation for many 
model animals and plants. Another case is that the protein of intere^st might be a novel protein, 
which might appear when pharmaceutical scientists try to screen for new drugs from synthetic 
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proteins. Second, there are prediction errors in gene-finding programs. Thus, it might not be 
possible to find the true protein from the predicted protein database. Third, wsome scientists 
might want to study the proteome before the genome, in which case no sequence database 
might be available. In a fourth situation, genes might undergo alternative splicing, which 
would result in novel proteins. The occurrence of single nucleotide polymorphisms (SNPs) in 
coding regions may also lead to different protein variants. In a fifth case, de novo sequencing 
can be helpful for studying amino acid mutations and protein modifications. Finally, when a 
database search generates ambiguous results, de novo sequencing can be used as a validation 
tool. 

Early development of de novo sequencing algorithms 

Over the years, various algorithms have been developed to address the de novo sequencing 
problem. One naive method [15, 16], is to list all possible candidate peptides according to 
the mass of the parent ion of the tandem mass spectrum. This is sometimes called exhaustive 
listing. All of the candidate peptides are then compared with the real spectrum to find out 
which one is the best match. One computational diflSculty inherent in this approach is that 
there will most likely be a large number of possible candidates for a typical parent mass 
that may range from below one thousand to several thousand Daltons [15, 16]. For example, 
as described in R,eference 15, for a parent peptide of molecular weight of 774, there will be 
21,909,046 possible candidate peptides. 

An alternative approach, sometimes called "subsequencing," has also been tried on mass 
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spectra data generated by various mass spectrometers [17-20]. In this approach, short se- 
quences that represent only a portion of the whole sequence are tested against the real spec- 
trum. Those subsequences that account for some observed ions are then extended one residue 
at a time until the whole sequence is tested. During the subsequence extension, only those 
subsequences that have significant matching with the real spectrum are retained. One disad- 
vantage of this approach is that some good candidate peptides might be discarded when some 
regions of a peptide are less represented by fragmentation ions. It is important to be aware 
that the fragmentation frequencies of a peptide are usually not evenly distributed over the 
whole peptide. 

A third method employs the use of graphical display of the data [21]. In this method, 
fragmentation ions that differ by the mass of one amino acid are represented by connected 
lines, thus allowing the visualization of ion series of the same type. Such an approach is not 
quite suitable for high-throughput environments, but this method can be quite helpful for 
manual de novo interpretation of tandem mass spectra. 

A fourth approach uses graph theory [22-25, 13]. This approach has been proven to be 
quite successful and will be discussed in the next seijsion. 

Application of graph theory in mass spectrometry 

An application of graph theory in de novo peptide sequencing was first proposed by Bartels 
[22]. The basic idea is to transform a real spectrum into a graph called ''Spectrum GrapK\ 
For the transformation, each peak in the real spectrum is represented as a vertex (or several 
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vertices) in the spectrum graph, and a directed edge is established between two vertices if the 
mass difference of the two vertices equals the mass of some amino acids. Various algorithms 
have been designed to find paths in the spectrum graph in which the corresponding peptides 
provide a good explanation of the real spectrum. Some algorithms of this type are introduced 
below 

The Lufefisk algorithm* The Lutefisk algorithm was designed by Taylor and Johnson 
[25]. In this algorithm, the authors first reduced the real spectrum data to a list of significant 
fragment ions. Then they determined the N- and C-terminal evidence lists, which provide the 
evidence for the possible N-terminal and C-terminal ions, respectively. After the N-terminal 
and C-terminal evidence lists were obtained, a "sequence spectrum" was formed, in which 
the X ordinate consisted of the m/z values for the b-ions and the y ordinate consisted of the 
probability of cleavage at each site. The program then proceeded by tracing out sequences, 
starting from the N-terminus, by finding b-ion values that differed from the N-terminus by 
the mass of one or several amino acids. After all the sequences had been obtained, a scoring 
procedure was carried out to rank the sequences. 

The SHERENGA algorithm. The SHERENGA algorithm was developed by Dancik 
et ai [13]. Because the creation of a spectrum graph is based on the ion types, the authors 
designed a method to automatically learn ion types from a training set of experimental spectra 
of known sequences, without knowing a priori the fragmentation patterns. After the ion types 
were learned, they transformed the real spectrum into a spectrum graph using the following 
steps. First, the ion types were represented by A = {^i, where k is the number of ion 

types and each Si represents the offset of the corresponding ion type. The real spectrum S was 
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then transformed into a spectrum graph G^{S) as follows. Each peak s of the real spectrum 
S generated k vertices V{s) = {.9 + (5i, .,.,.9 + 4}- Two vertices, u and were then connected 
by a directed edge from u to v if v - u equaled the mass of some amino acid. The peptide 
sequencing problem was then cast as the longest path problem in a directed acyclic graph. The 
authors also pointed out that the longest path may correspond to unrealistic solutions because 
it may use multiple vertices associated with the same real spectral peak. One solution is to 
find the longest anti-symmetric path. They claimed there exists an efficient algorithm to find 
the anti-symmetric longest path in the spectrum graph. 



Dynamic programming and suboptimal concept 

As pointed out by Dancik et al [13], the problem of finding the longest path in a directed 
acyclic spectrum graph while avoiding multiple assignments to the same peak is NP-complete 
in the general case [26]. However, Dancik et al. [13] and Chen et al [27] observed that there 
is a special structure for forbidden pairs of vertices (twins) in the spectrum graph. That is, 
the forbidden pairs are non-interleaving. Two forbidden pairs of vertices and (.^2,15/2) 

are non-interleaving if the intervals {x\^y^) and (.7:2,3^2) do not interleave. Chen et al. [27] 
then proposed a dynamic programming approach to find the anti-symmetric longest path in 
the spectrum graph. 

Dynamic programming is a common technique in solving optimization problems [26]. In 
dynamic programming, an optimization problem is solved in a bottom- up fashion, through 
combining the solutions to sub-problems, which gives an optimal solution for the original 



9 



problem, Chen et al [27], employing dynamic programming, provided a polynomial time 
algorithm for the de novo sequencing problem using a spectrum graph. Following is an out- 
line for the algorithm. First, an NC-spectrum graph G ("NC" denotes "N-terminal and 
C-terminal") is constructed from the real spectrum, by assuming that each peak in the real 
spectrum can be a b-ion or a y-ion. Thus, each peak in the real spectrum will generate two 
vertices in the NC-spectrum graph G. All of the vertices are then placed on the real line 
at positions corresponding to the mass vahies of the vertices. If the mass difference between 
two vertices u and v equals the total mass of some amino acid residues, a directed edge is 
drawn between u and pointing from the low-mass vertex to the high-mass vertex. To give 
the reader a feel for how an NC-spectrum graph can be generated from a mass spectrum, 
an example is given in Figure 2. Next, the nodes of G are renamed in an order from left to 
right as .To, .ri , . . . , x^, J/fc, • • - , yi , where every pair, Xi and yi, 1 < i < A:, corresponds to two 
mutually exclusive assumptions of the same mass peak. The dynamic programming algorithm 
then finds a path with the maximum path score from .tq to jt/o that contains the edge (.Ti, 
i # .7- 

The dynamic programming algorithm will find the optimal solution. However, the optimal 
solution may not be the real sequence that produces the real spectrum. Even database search 
programs sometimes report several sequences with similar scores because the scoring function 
can misinterpret the spectral data. Noise and unknown ions may also be interpreted as 
real ions by the programs. For these reasons, the suboptimal solutions are of great interest 
because they might give us the real sequence that generated the spectrum. Lu and Chen 
[28] further explored this application of suboptimal solutions in the dynamic programming 
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algorithm. In this siiboptimal algorithm, a real spectrum with k peaks is transformed into a 
matrix spectrum graph G = (V^E)^ where \V\ = 0{k^) and = 0{k^). A polynomial time 
siiboptimal algorithm was then proposed to find all of the suboptimal solutions (peptides) in 
0{p\E\) time, where p is the number of solutions. 

Inspired by the work of Dancik et al [13] and Chen [27], two more research groups also 
proposed dynamic algorithms to solve the de novo peptide sequencing problem [29, 30]. The 
work by Bafna and Edwards [29] also considered the suboptimal solutions in their dynamic 
programming machinery. 

For interested readers, a list of de novo peptide sequencing programs is listed in Table 1. 

Concluding remarks 

Mass spectrometry has become an important tool for proteomics study. Normally, search- 
ing against a sequence database is the first choice for protein identification. However, de 
novo sequencing comes into play in various situations. Over the years, numerous computer 
algorithms have been developed for de novo peptide sequencing, and there are also manual 
methods for interpretation of mass spectra. For example, one strategy for the interpretation 
of mass spectra generated from tryptic peptides was presented by Kinter and Sherman [31]. 

At the same time, the de novo sequencing problem via tandem mass spectrometry is still 
not solved in general. In other words, the information contained in tandem mass spectrometry 
cannot be readily converted into a fully unambiguous peptide sequence. Usually, a de novo 
sequencing program has been designed for some machine-dependent tandem mass spectra and 
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thus is not universally applicable to spectra generated by other types of mass spectrometers. 
To our knowledge, none of the current de novo sequencing programs take into account internal 
fragmentation of the parent ion. Also, current de novo sequencing programs usually assume 
that all daughter ions have a -1-1 charge state, which might not be the real case. 

Nevertheless, the development of computational algorithms, including de novo peptide 
sequencing methods, database search algorithms, and other computational tools, together with 
mass spectrometric instrumental developments, would substantially enhance our capabilities 
in biological studies. 
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Figure 1: A peptide of n amino acids and possible fragmentation patterns. When the peptide 
is fragmented between the i-th and i + 1-th amino acids, three possible fragmentations can 
happen, yielding three pairs of ions (oj, Xn-i), j/n-i)? ^^nd (c^, 2n-t)> respectively. The 
a-ions, fe-ions and c-ions are N-terminal ions while x-ions, y-ions and 2;-ions are C-terminal 
ions. The most common ion types are 6-ions and j/-ions- 
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Figure 2: An NC-spectrum graph example, (a) An artificial tandem mass spectrum of the 
peptide NDEMK(617.25 Daltons). (b) A sparse NC-spectrum graph constructed from the 
spectrum shown in A. One of the paths running from A^o to Co is bolded and labelled with 
possible corresponding amino acids. The symbol means "and", while "/" means "or". 
For example, "E+M" means "E and M (order undetermined)", while "Q/K" means "Q or 
K". 
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Table 1 : A list of de novo peptide sequencing programs. 
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